10. Cross-Validation and Feature Importance

Heading

Cross-Validation and Feature Importance

ND320 C4 L3 12 Nested Cross-Validation

Nested Cross Validation Summary

Using the Nested Cross Validation technique, we'd ideally pick the best hyperparameters on a subset of the data, and then evaluate it on a hold-out set which is similar to train-validation-test set split but we don't have enough data to do so. When you don't have enough data to separate your dataset into 3 parts, we can nest the hyperparameter selection in another layer of cross-validation.

We then walked through how to actually apply this technique on our dataset. Our performance dropped because we are now not overfitting our hyperparameters when we evaluate model performance.

ND320 C4 L3 13 Feature Importance

Summary

We have just learned that another way to regularize our model and increase performance (besides reducing the tree depth) is to reduce the number of features we use. The RandomForestClassifier can tell us how important the features are in classifying the data. We found the 10 most important features determined by the RandomForestClassifier and trained the model on just those 10 features. The trained model no longer misclassified bike as walk and this improved our classifier performance by 15%, just by picking the most important features!